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1. INTRODUCTION 

In a tone recognition system of musical instruments, there are two main parts. The first part takes 
the characteristics of the tone, or what is called the feature extraction part. The second part classifies the 
results of the feature extraction part, or what is called the classification part. A transform domain approach 
could be used to implement the feature extraction part. Discrete fourier transform (DCT) and discrete fourier 
transform (DFT) are two transformation methods for converting tone signals from the time domain to the 
transform domain. There are two different ways to feature extraction in the transform domain. The first is 
feature extraction, which is based on fundamental signals [1]-[6]. The second is feature extraction, which is 
not based on fundamental signals [7]-[12]. 

An approach to implementing the classification part in the tone recognition system above is to use a 
statistical approach. Support vector machine (SVM) is one example of classification that makes use of a 
statistical approach. SVM is a classification method that originates from statistical learning theory [13], [14]. 
Initially, SVM was only used for the classification of two classes. In subsequent developments, SVM can be 
used for multiclass classifications [15]-[21]. The previous tone recognition system research has primarily 
focused on tones with multiple major (significant) local peaks in the transform domain [9]-[11]. Tone 
recognition for tones having one, several or many major local peaks in the transform domain has very little to 
do with it. Previous research [12] proposed combining DFT-based segment averaging for feature extraction 
and template matching for classification to be used in a tone recognition system. That system could recognize 
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tones with one, several or many major local peaks in the transform domain. However, to identify a tone, the 
tone recognition system needs at least 16 feature extraction coefficients. Thus, for the tone recognition 
system, there is still a chance of obtaining feature extraction coefficients fewer than 16. The advantage of 
using fewer feature extraction coefficients is that we have fewer data to process. 

The conducted research combines feature extraction and classification methods for musical 
instruments tone recognition. To be more specific, it presents a DCT-based feature extraction and SVM 
classification for musical instruments tone recognition. As a first note, the feature extraction method does not 
use fundamental signals in the transform domain. As a second note, the tones used in the conducted research 
were bellyra, clarinet, and pianica tones, representing a tone with one, several, and many major local peaks in 
the DCT transformation domain. The DCT transformation domain of the bellyra, clarinet and pianica tones 
are shown in Figure 1. 
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Figure 1. Tone C in the normalized DCT transformation domain representation IY(k)| with sampling rate 
5000 Hz and DCT 128 points, (a) bellyra, (b) clarinet, (c) pianica 


2. RESEARCH METHOD 
2.1. Materials preparation 

A tone signal is used as the tone recognition system's input. The tone signal is an isolated signal that 
is stored in the waveform audio file format (WAV). This tone signal is acquired from three musical 
instruments played, namely bellyra, clarinet, and pianica. In this research, we recorded a total of eight tone 
signals for each musical instrument, namely C, D, E, F, G, A, B, and C'. We recorded the tone signals at a 
sampling rate of 5000 Hz. Essentially, this sampling rate has met the theorem of Shannon sampling [22] as: 


fs 2 2 fmax (1) 


with fmax being the highest frequency component of the tone signals and f; being the sampling rate. The 
highest frequency components of the tone C' for bellyra, clarinet, and pianica, according to our visual 
observations using Octave software, were 2097 Hz, 1406 Hz, and 1584 Hz, respectively. Based on our visual 
observations also, recording a tone for 2 seconds was adequate to acquire a steady-state part of the tone 
signal. It should be noted that the recorded tone signal can be divided into three parts, namely silence, 
transition, and steady-state parts. Only in the steady-state part there 1s accurate tone information. 

Three musical instruments were used in this research to acquire the above-mentioned tone signals. 
They were an Isuzu ZBL-27 bellyra, a Yamaha YCL-255 flute, and a Yamaha P-37D pianica, as shown in 
Figure 2. The tone signals were captured by using an AKG perception 120 USB microphone. 





(b) (c) 
Figure 2. Three musical instruments, (a) bellyra, (b) clarinet and (c) pianica 
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2.2. System design 

The entire tone recognition system in this research is shown in Figure 3 as a block diagram. The 
system's input is a WAV-formatted tone signal. The system's output is a text, which denotes a recognized 
tone. We used Octave software to develop the system design. The explanation of all blocks in Figure 3. 
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Figure 3. The entire system is shown as a block diagram 








2.2.1. Initial cutting 

Deleting the silence and transition parts in a tone signal is known as initial cutting. The silence and 
transition parts need to be deleted because there is no accurate tone information in these parts. The accurate 
tone information can only be obtained in the steady-state part of the tone signal. Based on our visual 
observations, the silence part could initially be deleted by making use an amplitude threshold value of 10.51 
from the tone signal's highest value. Beginning with the leftmost part of the tone signal, the signal was 
deleted, if the signal's amplitude was less than |0.5| from the highest value of the tone signal. Following the 
deletion of the silence part, the transition part was deleted. Based on our observations also, the transition part 
was deleted for 300 milliseconds from the tone signal's left side. Finally, the steady-state part could be 
obtained, after the silence and transition parts were deleted. 


2.2.2. Frame blocking 

Cutting a signal frame from a lengthy signal is known as frame blocking [23]. The objective of this 
frame blocking is to decrease the amount of signal data. The tone recognition system's computation time will 
be sped up if the amount of signal data is decreased. The length of frame blocking of 2”, where n is a positive 
integer, was evaluated in this research to find the shortest length of frame blocking that gave the highest 
recognition rate. 


2.2.3. Normalization 

Setting the highest value of a data signal to 1 or -1 is known as normalization. The objective of this 
normalization is to decrease the disparity between a data signal's highest value and the others. Normalization 
is implemented using (2). 


Vout = Yin/Max(lYin|) (2) 
where Yin and Yout are input and output data signal vectors, respectively. 


2.2.4. Windowing 

Windowing is smoothing discontinuities in a data signal's edges [23]. This discontinuity happens as 
a result of the data signal being cut in the preceding frame blocking. Discontinuities will give rise extra 
signals known as harmonic signals visible in the transformed data signal. The visibility of harmonic signals 
can be reduced by smoothing discontinuities. The Hamming window, which is extensively utilized in signal 
processing [24], was used in this research. Aside from that, the length of window 2” was used in this 
research, where n is a positive integer. This length of window is the same as the above-mentioned the length 
of frame blocking. 


2.2.5. DCT 


DCT is converting data signals from the time domain to the transform domain known as the DCT 
transformation domain. The length of DCT of 2” was used in this research, where n is a positive integer. The 
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length of DCT is the same as the length of frame blocking and the length of Hamming window mentioned- 
above. In addition, this research applied the calculation of absolute values of DCT results because the next 
process, logarithmic scaling, will not allow the calculation of negative values. 


2.2.6. Logarithmic scaling 

The gap in peak levels in a data signal can be reduced by logarithmic scaling. The logarithmic 
scaling results indicate an increase in the number of major local peaks. Previous research [11], [12] shows 
that feature extraction using segment averaging (which is used in this research) produces superior results for 
data signals with many major local peaks. The following is a mathematical expression of logarithmic scaling. 


Yout (k) = log(Yin(k) + 1) (3) 


where You2(K) = {Your (0), Your (1), --- Yout(N — 1) and Y;,(k) = {Yin (0), Yin (1),.--, Yin CV — 1) are the data 
signal's output and input, respectively, and N is the length of data signal. The inclusion of value '1' in (3) 
prevents a logarithmic outcome near to negative infinity if the input data signal has a value close to zero. 


2.2.7. Frame warping 
Frame warping is reducing the length of a data frame. The results of frame warping show a more 

dense distribution of data. Basically, this frame warping is carried out by dividing the data frame into two and 

then combining the two. The algorithm of this frame warping is presented as: 

Frame warping algorithm: 

1. Consider and input data frame Y,,,(k) = {Yin (0), Y;,(1),...,¥;,(N — 1), where N is an even number that 
represents the length of the input data frame. 

2. Divide Y;,(k) into two segments V,(k) = {Yj,(0), Yin(1),..- Vin((N/2) —1) and V3(k) = {Yn (N/ 
2), Yin((N/2) + 1), -.., Yn (N — D}. 

3. Flip V2(k) in order to be V, (k) ={¥,,(N —1),-++.¥,((N/2) +D, Y, (N12) 

4. Merge the results of V,(k) and Vzf(k), in order to be an output data frame Y,ut(k) = 
{Your (0), Yout (1), s Yout ((N/2) = 1)}, as. 


Yout (k) = [V; (k) + Vaf (k)]/2 (4) 
As a note, this frame warping will give an output data frame that is half the length of the input data frame. 


2.2.8. Segment averaging 
One way to reduce the amount of data in a data frame is to use segment averaging. This research 
used segment averaging from the previous research [11], [12]. Basically, this segment averaging is carried 
out by dividing a data frame into a number of data segments and then carry out an averaging operation in 
each data segment. The algorithm of this segment averaging is presented as: 
Segment averaging algorithm: 
1. Suppose there is an input data frame Yin (k) = {Yin (0), Yn(1),..., Yin(M — 1), with the length of frame 
M = 2%and q = 0. 
2. Determine the length of a segment L, where L = 2P forO<p<q. 
3. Cut the input data frame Y; (k) into a number of segments, where the length of each segment is L. The 
result of dividing the input data frame Y; (k) will give a number of Z of segments as: 


Z=M]JL (5) 


and also a data sequence S, (u) = {S (0), S (1), ..., So (L — 1)} in each segment. 
4. Calculate the output data frame Yout (Y) = {Yout (0), Yout (1), ---, Yout(Z — 1)} by performing an average 
operation in each segment as: 


Your (v) = SO (6) 


The length of segment L was evaluated in this research at 1, 2, 4, ..., DE”) values, where M is 


the length of the frame resulting from the frame warping. As a note, the (5) can be stated as follows, if N is 
the length of the input data frame in the frame warping. 


Z=N/(2L) (7) 
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2.2.9. SVM classification 

Classification is determining the pattern class of a data frame. SVM is a method that can be used for 
this classification. SVM is a linear classification. In training an SVM, the best hyperplane will be explored. 
This hyperplane separates two data sets, which come from two different pattern classes. Mathematically, this 
hyperplane is a linear discriminant function. 

It is not necessarily that a hyperplane can separate the two data sets from two different pattern 
classes in the real world. In other words, the two data sets are not linearly separable. Therefore, data 
transformation needs to be carried out. To carry out this transformation, a function called kernel function [25] 
can be used. Linear and polynomial kernel functions are two examples of commonly used kernel functions. 
This research evaluated that two kernel functions. 

Initially, SVM was developed for the case of two-class pattern classification. Furthermore, SVM 
was developed for the case of multiclass pattern classification. This research is a case of multiclass pattern 
classification. For this multiclass case, one-vs-all (OVA) Tree Multiclass method [20] is used. The selection 
of this OVA is based on the performance of this OVA, which is comparable to the other methods, particularly 
[16] and [21]. In this research, we used LibSVM and its default settings [26]. 


2.3. Feature extraction and SVM training 

The SVM shown in Figure 3 needs a number of data in the training process. This data is obtained 
using the feature extraction proposed in this research, shown in Figure 4. The input is a WAV-formatted tone 
signal. The output is the feature extraction from the input tone signal. As a note, every block in the proposed 
feature extraction is the standard one. However, if we look at the complete picture (the series of blocks in 
Figure 4), we can see that the proposed feature extraction is unique. Thus, we can see a novelty in this 
research. For each musical instrument (bellyra, clarinet, or pianica) in this research, 10 samples were 
recorded for each tone signal (C, D, E, F, G, A, B, or C’). So there were a total of 240 tone signals. In 
addition, the feature extraction of each of the tone signals was processed by the feature extraction shown in 
Figure 4. SVM training used the results of this feature extraction. 


Input 
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Logarithmic 7 n ; 
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Segment (Feature extraction) 


Frame warping "os 
averaging 


Figure 4. The proposed feature extraction is shown as a block diagram 


2.4. Test tones and recognition rate 

The number of tone signals used to test the performance of the tone recognition system are the test 
tones. For each tone signal (C, D, E, F, G, A, B or C’) of each musical instrument (bellyra, clarinet, or 
pianica), a number of 20 samples were recorded for the test tones. So, for each musical instrument there was 
a total of 160 tone signals. The recognition rate is the magnitude of tone system recognition performance. 
The following is how the recognition rate is calculated. 


Number of correctly recognized tones 


Recognition Rate = x 100 % (8) 


Test tones 


3. RESULTS AND ANALYSIS 
3.1. Test results 

Test results of the developed tone recognition system for using a linear kernel function in SVM 
classification, for different combinations of the lengths of frame blocking and the number of feature 
extraction coefficients, are presented in Table 1. 
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Table 1. Test results for the use of a linear kernel function in SVM classification, for different combinations 
of the lengths of frame blocking and the number of feature extraction coefficients. results shown: recognition 


rate (%) 
The langth of frame Number of feature extraction coefficients (coefficients) 
blocking (points) 1 2 4 8 16 32 64 128 256 
(a) Musical instrument: Bellyra 
64 16.88 27.50 78.75 98.75 100 100 - - - 
128 13.75 36.88 80.00 100 100 100 100 - - 
256 12.50 37.50 81.88 100 100 100 100 100 - 
512 15.63 38.13 76.88 100 100 100 100 100 100 
(b) Musical instrument: Clarinet 
64 19.38 27.50 69.38 93.75 100 100 - - - 
128 22.50 25.63 78.75 97.50 100 100 100 - - 
256 16.88 28.75 65.63 100 100 100 100 100 - 
512 12.50 22.40 71.88 98.75 100 100 100 100 100 
(c) Musical instrument: Pianica 
64 30.63 51.25 93.75 100 100 100 - - - 
128 24.38 41.25 91.88 98.75 100 100 100 - - 
256 20.00 35.63 81.25 100 100 100 100 100 - 
512 26.25 36.88 72.50 100 100 100 100 100 100 


As indicated in Table 1, the recognition rate increases if the number of feature extraction 
coefficients increases. If the number of feature extraction coefficients increases, it will further increase the 
dimension of the feature extraction space. The increased dimension of the feature extraction space makes it 
easier to differentiate between one pattern class and the other pattern classes. The easier differentiation of one 
pattern class with the other pattern classes will ultimately increase the recognition rate. 


3.2. The smallest number of feature extraction coefficients 

The goal of this research is to discover the smallest number of feature extraction coefficients that 
can be used in a tone recognition system. Here, the tone recognition system can recognize the tone with one, 
several, or many major local peaks in the transform domain. As indicated in Table 1, the use of the smallest 
number of feature extraction coefficients, i.e. eight coefficients, and the shortest length of frame blocking, 1.e. 
256 points, can result in the highest recognition rate of up to 100%. In other words, by using at least eight 
feature extraction coefficients, the tone recognition system can recognize all the tested tones. The tested tones 
in this case are those with one, many, or many major local peaks in the transform domain. 

Furthermore, the number of the smallest (eight) feature extraction coefficients above should be 
noted. It is linked to the usage of a linear kernel function in SVM classification. For the use of other kernel 
functions (i.e. polynomial functions) in SVM classification, Table 1 has been reworked for the second and 
third-order of polynomial functions. The results obtained from the use of both polynomial functions are 
presented in Table 2. 


Table 2. The results of the use of linear kernel and polynomial functions on SVM classification 
Number of feature extraction 


Kernel function on SVM classification Soene sit coef Cisal) Highest recognition rate (%) 
Linear 8 100 
Second order polynomial 8 98.75 
Third order polynomial 16 98.75 


The use of linear kernel functions gives the best results, as indicated in Table 2. Only by using eight 
feature extraction coefficients (the smallest number of coefficients), it can give a recognition rate of up to 
100%. This indicates that by using eight feature extraction coefficients, the pattern classes of feature 
extraction of the tone signals are linearly separable. 


3.3. Comparison of some feature extraction and classification combinations 

The performance of some feature extraction and classification combinations for musical instrument 
tone recognition are compared in Table 3. As indicated in Table 3, the feature extraction and classification 
combination proposed in this research is the most efficient for use in a musical instruments tone recognition 
system. This is because the tone recognition system needs only eight feature extraction coefficients (the 
smallest number of coefficients) to recognize the tones with several major local peaks in the transform 
domain. 
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Table 3. The best performance comparison of some feature extraction and classification combinations, for the 
musical instruments tone recognition. Results shown: number of smallest feature extraction coefficients 
(musical instrument) 
Number of major local peaks in the transform domain Highest recognition 
One Several Many rate (%) 
22 (cello, piano, 


Feature extraction/classification 


MFCC/K-NN [9] - 22 (flute) or 91.66 
trumpet, violin) 

Spectral Features/SVM [10] - 21 (gamelan) - 98.93 
DCT based segment 16 (soprano saa 

averaging/template matching [11] Í recorder) S (pramca) me 
DFT based segment 16 (tenor bares 

averaging/template matching [12] hee recorder) atpianicy) ne 
DE Pease d oomen ayore DE SNM 8 (bellyra) § (clarinet) 8 (pianica) 100 


(this research) 


4. CONCLUSION 

The conducted research proposes a feature extraction and classification combination in a tone 
recognition system for musical instruments. The purpose of using this combination is to obtain a tone 
recognition system, which in the recognition process uses the smallest number of feature extraction 
coefficients. To do this, we combined a DCT based feature extraction and an SVM classification to be used 
in the tone recognition system. We have discovered from the test results that the proposed feature extraction 
and classification combination makes the tone recognition system efficient enough. This was because the 
tone recognition system only needed at least eight feature extraction coefficients to recognize tones with one, 
several, or many major local peaks in the transform domain. We have also discovered that SVM 
classification requires only a linear kernel function. This one indicates that the pattern classes from the DCT 
based segment averaging feature extraction are linearly separable. For further development of this research, 
we recommend exploring other feature extraction and classification combinations. In this case, these 
combinations can use different methods for feature extraction (other than DCT based segment averaging) and 
different methods for classification (other than SVM). 
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